Extracting vocal sources from master audio recordings
نویسندگان
چکیده
Our goal is to separate vocals from background music in single-source songs. We examined prior work ([1][2][3][5][6][7]) in the area and decided to experiment with 2D Adaptive Probabilistic Latent Component Analysis (PLCA) [1][5] because it is a good trade-off between implementation difficulty and resulting accuracy. The training data for the 2D Adaptive PLCA algorithm is a set of song segments containing only background music. The adaptive algorithm is more practical than PLCA because typically, a song will have background music only segments, but not vocal only segments. This prevents training directly on the vocal segments. Thus, the adaptive version learns about vocal segments dynamically by adding spectral basis vectors to represent voice as it’s performing the separation [1][5]. We also introduce an SVM at the input stage to create the training samples for the PLCA algorithm to improve automation of the process. Perceptually, we are able to separate background music and vocals in a very noticeable way, and make use of an established method to quantify our results. Introduction Semi-blind source separation given a single observation containing multiple sources is a popular and difficult problem. In particular, there has been much work in the area of audio source separation, of in this work’s case, vocals and background music [1][2][3][6][7]; however, there are few actual products on the market. Motivating examples range from editing previously mixed songs, creating karaoke tracks, and removing undesirable noise. Figure 1 gives an overview of our proposed algorithm. First, a support vector machine (SVM) is used to label sections of songs that contain vocals and sections with just background music. This improves upon prior work by allowing for the separation of a large class of songs after training the SVM to classify vocal containing segments in a particular band or genre. The labeled song is then passed to the PCLA algorithm which after taking the Short Time Fourier Transform (STFT), also known as the spectrogram, uses Expectation Maximization (EM) to learn the spectral signature of the background. This spectral signature is then used with the same PCLA algorithm to estimate the spectral signature of the vocals (along with a time signature for the whole mixture). We then can extract both background and vocals by projecting the spectrogram on the learned basis. The following sections will describe the SVM and the PCLA algorithm in detail as well as summarize our results. Figure 1: Overview of dataflow in our implementation SVM Creating training data for PLCA is tedious since it involves manually classifying song segments in every song that’s being separated. To improve the scalability of this process, we experimented with training an SVM to automate this step. It is possible to simply train PLCA directly on the training data used for the SVM; however, PLCA doesn’t perform well in this case because the training data isn’t local enough to accurately represent the background in the section being separated. By using an SVM on the input phase, we guarantee that the training data is localized to the section that’s being separated. Feature Representation We experimented with two potential input feature vectors. Based on prior work in the field of audio source separation, we concluded that spectrogram based approaches are most promising and focused our attention on generating an input vector from spectrograms of song segments which have vocals and song segments which don’t have vocals. In our first approach, we reshaped the spectrogram matrix into a vector and used that directly as input to the SVM. We were surprised that such a simple approach yielded 85% accuracy with certain parameters. We also experimented with re-scaling the power spectrum according to the mel frequency scale. The mel scale is derived to be a natural scale with regard to human auditory and speech processes. It is therefore a logarithmic representation of the original power spectrum. We chose to use the mel basis because many speech-from-noise separation algorithms also make use of it. We can also achieve significant reduction in training example dimensionality: while a typical audio power spectrum contains frequency values between 20 Hz and 20,000 Hz, the corresponding mel-representation contains ~50 numbers (Figure 2). With this we were able to achieve nearly 90% accuracy (Figure 3). Figure 2: A comparison between the standard time-frequency scale and the corresponding mel-frequency scale. We were able to reduce the dimensionality of our problem significantly by using the mel components. In addition to using the mel spectrum as feature variables, we also computed spectral properties such as the flatness, brightness, and root-mean squared amplitude. In doing so we were able to gather a wealth of information about a power spectrum and work with it in a relatively low dimensional space. This significantly improves the SVM training efficiency. To get the training data, we acquire a library of songs, randomly select a song, and randomly select a corresponding 1-second segment. We then manually determine whether the segment has vocals or not. These segments (with labels) are the inputs to the SVM. This process is mostly automated and very rapid; hundreds of training examples can be gathered in minutes. Accuracy Evaluation For estimating prediction accuracy, we used 10-fold cross validation. Our preliminary training set library is the first 10 songs from the Beatles album “Rubber Soul.” We observed the effects of using different kernels in the SVM as well as the effects of down sampling the audio segments. We also experimented with changing the number of FFT points and using various windows, but found no significant differences in the results by varying these parameters. Decimation We observed the effects of down sampling the audio segments that are chosen as training examples, and fed into the training algorithm. The audio is sampled at 44.1 kHz. By decimating the original audio we were able to achieve a higher SVM prediction accuracy until the decimation factors get large to the point where important information is lost (Figure 2). Figure 2 (Left): Effect of decimation factor on accuracy with different kernels using 1024 FFT points, a hanning window of 1024 points, 1⁄2 overlap, training on “Closer to You” and using 10-fold cross validation to determine prediction accuracy. Figure 3 (Right): Effect of decimation factor on accuracy for a quadratic kernel using a vector of the mel components, amplitude, flatness and brightness on a spectrogram with 4096 FFT points, a hanning window of 4096 points, 2⁄3 overlap, training on a collection of Beatles songs and using 10-fold cross validation to determine prediction accuracy. Kernels We experimented with linear, quadratic and gaussian kernels (with varying width). We found that the width of the Gaussian kernel doesn’t play a large role in determining the accuracy of the SVM algorithm and performs poorly overall. We achieve the best accuracy with a quadratic kernel (Figure 2). 2D Adaptive PLCA We implemented 2D PLCA in Matlab based on [1][5]. The algorithm is an EM algorithm that operates on the magnitudes of the spectrograms, S(t,f) where t is time and f is frequency. The spectrogram is normalized and treated as a distribution such that the sum of all (t,f) pairs (quanta) is one. The equations in figure 4 are iterated on until convergence (typically 100 iterations). Refer to [1] and [5] for a detailed explanation of the theory. We later obtained and experimented with a more performance optimized, but functionally identical implementation of 2D PLCA from the authors of [1]. Expectation: P := FZT R(f,t) := S(f,t) / P(f,t) Maximization: F := F∘RTT T := T∘TTR
منابع مشابه
Separation of Vocals from Polyphonic Audio Recordings
Source separation techniques like independent component analysis and the more recent non-negative matrix factorization are gaining widespread use for the monaural separation of individual tracks present in a music sample. The underlying principle behind these approaches characterises only stationary signals and fails to separate nonstationary sources like speech or vocals. In this paper, we mak...
متن کاملSinging Voice Separation Using Spectro-Temporal Modulation Features
An auditory-perception inspired singing voice separation algorithm for monaural music recordings is proposed in this paper. Under the framework of computational auditory scene analysis (CASA), the music recordings are first transformed into auditory spectrograms. After extracting the spectral-temporal modulation contents of the timefrequency (T-F) units through a two-stage auditory model, we de...
متن کاملAutomatic Labeling of Training Data for Singing Voice Detection in Musical Audio
We present a novel approach to labeling a large amount of training data for vocal/non-vocal discrimination in musical audio with the minimum amount of human labor. To this end, we use MIDI files for which vocal lines are encoded on a separate channel and synthesize them to create audio files. We then align synthesized audio with real recordings using dynamic time warping (DTW) algorithm. Note o...
متن کاملPerceptual constraints for automatic vocal detection in music recordings
Background in Music Information Retrieval. For many applications in music information retrieval, automatic music structure analysis is a desired capability, including the detection of vocal/sung segments within a musical recording. While there have been several recent studies in the area of automatic vocal detection in music recordings, current performance is not sufficient for all applications...
متن کاملNonnegative Tensor Factorization with Frequency Modulation Cues for Blind Audio Source Separation
We present Vibrato Nonnegative Tensor Factorization, an algorithm for single-channel unsupervised audio source separation with an application to separating instrumental or vocal sources with nonstationary pitch from music recordings. Our approach extends Nonnegative Matrix Factorization for audio modeling by including local estimates of frequency modulation as cues in the separation. This permi...
متن کاملA Real Time Singing Voice Removal System Using DSP and Multichannel Audio Interface
Separating technique for singing voice from music accompaniment is very useful in original sound type Karaoke instrument. We propose a real-time system to separate singing voice from music accompaniment for stereo recordings. Proposed algorithm consists of two stages. The first stage is a spectral change detector. The last stage is a selective vocal separation in frequency bins. Our system cons...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012